This guide helps you access the SLU Basreg database (basregDWH) using R and RStudio.
You will learn how to:
If you are unfamiliar with Rstudio, Posit (the maker of Rstudio) describes the interface here: https://docs.posit.co/ide/user/ide/guide/code/projects
If you are reading this documentation with the intention of using it for data analysis. Please make sure you understand how R projects work: https://docs.posit.co/ide/user/ide/guide/code/projects.html
Important: Users have read-only
access.
This means you can view and download data, but you are not at
risk of changing the database.
You need:
RStudio
Official getting started guide:
https://posit.co/resources/videos/getting-started-with-rstudio/
Access to SLU Basreg / BasregDWH
For permissions, contact: Eva Rundlöf
Either being at a SLU campus or use the SLU
VPN
How to use VPN:
https://internt.slu.se/en/support-services/administrative-support/it/support/guider-manualer/vpn-anyconnect/
If you have a Windows-laptop issued by SLU, ODBC will automatically verify your identity when connecting. Linux and Mac users need to follow a separate instruction.
If you want to keep your work organised, you can use Git directly in
RStudio:
https://docs.posit.co/ide/user/ide/guide/tools/version-control.html
Repository:
https://github.com/TKlingstrom/Basreg_introduction
You can download it in two ways:
.Rproj file in RStudioPreferably you follow the guide directly in R-markdown. To do this you open basreg_userguide.Rmd with Rstudio or open Rstudio and select File → Open project in new session.
You can also create your own R-script (In RStudio: File → New File → R-script) and copy the code from the HTML file (or PDF).
The difference between R markdown and an R-script is that in markdown you can run and rerun your code in “chunks”. Chunks are grey coloured and got a small green play arrow in the upper right corner. The output from running the code is displayed right below the code. R markdown can also easily be converted into a PDF or html file. I recommend that you follow this guide directly in the R markdown file or, alternatively open the HTML file in your browser and then write the code into an R-script. In the R markdown file you can scroll down until you find this text and then begin the tutorial below. Each code chunk must be run at least once (Clicking the green play button in the upper right corner of the markdown file or by copying it into your R script, selecting it and hitting CTRL + Enter). You can modify the code and rerun it as you wish, the name to the left of the <-will be the name of the object generated by the code and can be accessed in the Environment tab (top right pane).
We use these packages which must be installed and loaded (run the code below):
This chunk creates a connection object called
con.
Think of con as the way to tell your Rstudio session to
connect to the Basreg server.
You do not download the whole database into your
computer.
Instead, you send requests to the server to collect data from SQL
views.
A view is a way to present information in an SQL server and contain primarily observation-level data (events, measurements). Due to the way Basreg was developed it also contain some less useful columns necessary for other ways to interact with Basreg.
# Connecting to the database using your R/Windows user credentials
con <- dbConnect(
odbc(),
Driver = "SQL Server",
Server = "basregdwh.db.slu.se",
Database = "basregDWH"
)
# List objects available in the apiSci schema
odbcListObjects(con, catalog = "basregDWH", schema = "apiSci")The function tbl() creates a reference to a database
table/view specified in the paranthesis.
Thanks to dbplyr you can filter or manipulate the data before downloading it. In this case:
head(10) limits to a small previewcollect() downloads those rows into R as a data
frameThe %>% is how you define the different steps dplyr and dbplyr
will take in a pipe when filtering or manipulating data.
Writing DF.Cow after the pipe then shows the output to you. In the
environment tab (upper right pane in Rstudio) you also see all objects
you create. Clicking the blue circle before them shows the datatype and
clicking the table far to the right will you a table with the contents
of the dataframe. The Connections tab to the right of Environment also
show you all the database schemas you can access, if you have access you
can click the blue circle and see what is inside of it. The views we
will work with are located in basregDWH → apiSci and
starts with “fact”.
DF.Cow <- con %>%
tbl(in_catalog("basregDWH", "apiSci", "factCattleView")) %>%
head(10) %>%
collect()
DF.CowTry switching the object name in the code above, for example:
"factCattleView" to"factCattleReproductionView"Many Basreg views are sparse.
That means most rows contain only a few filled columns and many
NA.
Filtering helps you select only meaningful rows.
Example:
!is.na(Calving) means: keep only rows where Calving
existsDryOff instead gives dry-off eventsDF.Reproduction <- con %>%
tbl(in_catalog("basregDWH", "apiSci", "factCattleReproductionView")) %>%
filter(!is.na(Calving)) %>%
collect()
DF.ReproductionYou can combine filters.
In filter(), multiple conditions mean:
condition 1 AND condition 2 AND condition 3
Here we select:
DF.Reproduction <- con %>%
tbl(in_catalog("basregDWH", "apiSci", "factCattleReproductionView")) %>%
filter(
!is.na(Calving),
StartDate >= as.Date("2021-01-01"),
StartDate <= as.Date("2021-01-31")
) %>%
collect()
DF.ReproductionSometimes you want either:
In R:
| means OR& means AND (commas also work as
you saw above)Here we select:
DF.Reproduction <- con %>%
tbl(in_catalog("basregDWH", "apiSci", "factCattleReproductionView")) %>%
filter(
(!is.na(Calving) | !is.na(DryOff)),
StartDate >= as.Date("2021-01-01"),
StartDate <= as.Date("2021-01-31")
) %>%
collect()
DF.ReproductionIn Basreg views there are often many columns, The function select() keeps only the columns you specify. Everything else is removed from the download.
Here we reuse the same query as in Step 5 (Calving or DryOff events in January 2021), but only collect the most important columns.
DF.Reproduction <- con %>%
tbl(in_catalog("basregDWH", "apiSci", "factCattleReproductionView")) %>%
filter(
(!is.na(Calving) | !is.na(DryOff)),
StartDate >= as.Date("2021-01-01"),
StartDate <= as.Date("2021-01-31")
) %>%
select(Farm, SE_Number, StartDate, Calving, CalvingEase, DryOff) %>%
collect()
DF.ReproductionWe will now modify the view before downloading it from the server.
In the chunk below we remove the columns Calving,
DryOff and StartDate and with the mutate
function instead create:
CalvingDate which contains the date of the calving
(from StartDate)DryOffDate which contains the date of the dry-off (from
StartDate)This makes the dataset easier to interpret after downloading it.
DF.Reproduction <- con %>%
tbl(in_catalog("basregDWH", "apiSci", "factCattleReproductionView")) %>%
filter(
(!is.na(Calving) | !is.na(DryOff)),
StartDate >= as.Date("2021-01-01"),
StartDate <= as.Date("2021-01-31")
) %>%
mutate(
# mutate() creates new columns (or modifies existing ones)
# Read more here: https://dplyr.tidyverse.org/reference/mutate.html
CalvingDate = if_else(!is.na(Calving), StartDate, as.Date(NA)),
DryOffDate = if_else(!is.na(DryOff), StartDate, as.Date(NA))
) %>%
select(Farm, SE_Number, CalvingDate, DryOffDate, CalvingEase) %>%
collect()
DF.ReproductionThe reproduction view tells us when a calving happened, but it does not directly show which offspring cow was born from that event.
To connect the calving event to the offspring identity, we will combine two views:
factCattleReproductionView contains calving information
and calving datesfactCattleView contains the offspring cow identity and
the mother’s identityIn factCattleView, the mother’s SE-number is stored in
the column Mother.
We will:
Farm, SE_Number,
CalvingDate, and CalvingEase from
factCattleReproductionViewSE_Number (offspring), Mother, and
BirthDate from factCattleViewSE_Number (mother) = MotherCalvingDate = BirthDateSE_Number into Offspring# IMPORTANT:
# In this step we do NOT download (collect) the full views into R.
# Instead, dbplyr keeps the objects as "lazy" SQL queries that remain on the Basreg server.
# This means filter(), transmute() and inner_join() are translated into SQL and run server-side.
# Only when we call collect() at the end will the result be downloaded into R.
# Running computations server-side is appropriate when it reduces the amount of data you need to download. Running large computations serverside
# should however be avoided as it may impact other users.
#
# Read more here:
# - dbplyr basics (lazy queries): https://dbplyr.tidyverse.org/articles/dbplyr.html
# - dbplyr translation to SQL: https://dbplyr.tidyverse.org/articles/translation.html
# 1) Prepare the calving events table (still server-side, NOT collected)
DF.ReproductionEvents <- con %>%
tbl(in_catalog("basregDWH", "apiSci", "factCattleReproductionView")) %>%
filter(
!is.na(Calving),
StartDate >= as.Date("2021-01-01"),
StartDate <= as.Date("2021-01-31")
) %>%
transmute(
Farm = Farm,
SE_Number = SE_Number, # Mother identity in reproduction view
CalvingDate = StartDate, # Calving date is stored in StartDate
CalvingEase = CalvingEase
)
# 2) Prepare the cattle view with offspring information (still server-side, NOT collected)
DF.Cattle <- con %>%
tbl(in_catalog("basregDWH", "apiSci", "factCattleView")) %>%
transmute(
Offspring = SE_Number, # Offspring identity in cattle view
Mother = Mother, # Mother identity in cattle view
BirthDate = BirthDate
)
# 3) Join the two server-side objects in SQL Server and sort from first born to last born
DF.CalvingWithOffspring <- DF.ReproductionEvents %>%
inner_join(
DF.Cattle,
by = c("SE_Number" = "Mother", "CalvingDate" = "BirthDate")
) %>%
arrange(CalvingDate) %>% # arrange() sorts the output by date (oldest first)
select(Farm, SE_Number, Offspring, CalvingDate, CalvingEase) %>%
collect() # collect() runs the SQL query and downloads ONLY the final joined result into R
# 4) List any calving events that did not find an offspring match
DF.UnmatchedCalvings <- DF.ReproductionEvents %>%
anti_join(
DF.Cattle,
by = c("SE_Number" = "Mother", "CalvingDate" = "BirthDate")
) %>%
arrange(SE_Number, CalvingDate) %>%
collect()
DF.CalvingWithOffspring #This table will contain all cases where we now have a Mother and a SE_Number for the calf.DF.UnmatchedCalvings #This table will contain all cases where we could not find a Cow born on the right day with a mother giving birth that day.We will now use the calving table we created in Step 8 to find milking records after each calving.
For each mother (SE_Number) and each
CalvingDate in the calving table, we select rows from
factCattleMilkingView where the milking date
(StartDate) is within 365 days after the calving
date.
To do this efficiently, we keep the work server-side
and join the views in SQL Server. Only the final result is downloaded
when we use collect().
# IMPORTANT:
# We recreate the calving table as a server-side query (not collected),
# then join it to the milking view on the SQL server.
# This avoids looping in R and avoids downloading large views.
#
# Read more here:
# - dbplyr joins: https://dbplyr.tidyverse.org/articles/two-table.html
# - dbplyr SQL translation: https://dbplyr.tidyverse.org/articles/translation.html
# 1) Calving table (server-side, NOT collected)
DF.CalvingWithOffspring_db <- con %>%
tbl(in_catalog("basregDWH", "apiSci", "factCattleReproductionView")) %>%
filter(
!is.na(Calving),
StartDate >= as.Date("2021-01-01"),
StartDate <= as.Date("2021-01-31")
) %>%
transmute(
Farm = Farm,
SE_Number = SE_Number, # Mother
CalvingDate = StartDate,
CalvingEase = CalvingEase
) %>%
inner_join(
con %>% tbl(in_catalog("basregDWH", "apiSci", "factCattleView")) %>%
transmute(
Offspring = SE_Number,
Mother = Mother,
BirthDate = BirthDate
),
by = c("SE_Number" = "Mother", "CalvingDate" = "BirthDate")
) %>%
select(Farm, SE_Number, Offspring, CalvingDate, CalvingEase)
# 2) Milking view (server-side, NOT collected)
DF.Milking <- con %>%
tbl(in_catalog("basregDWH", "apiSci", "factCattleMilkingView"))
# 3) Join calvings to milking records and keep only milking within 365 days after each calving
DF.MilkingAfterCalving <- DF.CalvingWithOffspring_db %>%
inner_join(DF.Milking, by = "SE_Number") %>% # match milking records to the mother
filter(
StartDate >= CalvingDate,
StartDate < sql("DATEADD(day, 365, CalvingDate)") #This is where the 365 day limit is kept
) %>%
arrange(SE_Number, CalvingDate, StartDate) %>% # sort from earliest to latest
collect() # download only the filtered result to R
DF.MilkingAfterCalvingWe will now summarise the milking records we selected in Step 9.
We group the dataset by CalvingEase and calculate:
StartDate)TotalYield)Because cows can be milked several times per day, the table
factCattleMilkingView can contain multiple rows per cow and
day. In that case TotalYield is typically the yield per
milking, not the total daily yield. To get a correct daily yield we
first sum all milkings within the same day for each cow, and then
calculate the average daily yield.
To better describe variation, we also report the median and quartiles (25% and 75%).
# This step uses group_by() and summarise() to calculate statistics.
# Read more here:
# - group_by(): https://dplyr.tidyverse.org/reference/group_by.html
# - summarise(): https://dplyr.tidyverse.org/reference/summarise.html
#
# IMPORTANT:
# In factCattleMilkingView there can be multiple rows per day (multiple milkings).
# TotalYield is then usually the yield per milking, not per day.
# To calculate DAILY yield, we first sum TotalYield within each day for each cow.
# 1) Calculate daily total yield per cow (sums multiple milkings on the same date)
DF.DailyMilkPerCow <- DF.MilkingAfterCalving %>%
group_by(CalvingEase, SE_Number, StartDate) %>%
summarise(
DailyTotalYield = sum(TotalYield, na.rm = TRUE), # sum yield across milkings that day
MilkingsPerDay = n(), # number of milkings that day
.groups = "drop"
)
DF.DailyMilkPerCow# 2) Summarise per cow (within each CalvingEase group)
DF.MilkingPerCow <- DF.DailyMilkPerCow %>%
group_by(CalvingEase, SE_Number) %>%
summarise(
MilkingDays = n_distinct(StartDate), # number of days with milking records
AvgDailyYieldCow = mean(DailyTotalYield, na.rm = TRUE),# average DAILY yield per cow
.groups = "drop"
)
DF.MilkingPerCow# 3) Summarise per CalvingEase group (mean, median, quartiles)
DF.SummaryByCalvingEase <- DF.MilkingPerCow %>%
group_by(CalvingEase) %>%
summarise(
Cows = n_distinct(SE_Number),
AvgMilkingDaysPerCow = mean(MilkingDays, na.rm = TRUE),
MedianMilkingDaysPerCow = median(MilkingDays, na.rm = TRUE),
MilkingDays_Q25 = quantile(MilkingDays, 0.25, na.rm = TRUE),
MilkingDays_Q75 = quantile(MilkingDays, 0.75, na.rm = TRUE),
AvgDailyYield = mean(AvgDailyYieldCow, na.rm = TRUE),
MedianDailyYield = median(AvgDailyYieldCow, na.rm = TRUE),
DailyYield_Q25 = quantile(AvgDailyYieldCow, 0.25, na.rm = TRUE),
DailyYield_Q75 = quantile(AvgDailyYieldCow, 0.75, na.rm = TRUE),
.groups = "drop"
)
DF.SummaryByCalvingEaseWe will now plot daily milk yield for each cow in the dataset.
Instead of using lines (which can become difficult to interpret when
many cows overlap), we plot each milking day as a dot.
The dots are coloured by CalvingEase.
Note that step 1-3) Are rather repetitive examples of joining while step 4 is the new step with plotting.
# We plot milk yield by Days in Milk (DIM) instead of calendar date.
# ggplot2 is used for plotting
# Read more here: https://ggplot2.tidyverse.org/
# DIM = number of days since calving for each cow.
# This makes cows comparable even if they calved on different dates.
# IMPORTANT:
# DF.DailyMilkPerCow currently only contains StartDate (milking date), not CalvingDate.
# We therefore re-create a calving table server-side and join it to the daily milk table,
# then calculate DIM in R after collect().
# 1) Calving table (server-side, not collected)
DF.CalvingWithOffspring_db <- con %>%
tbl(in_catalog("basregDWH", "apiSci", "factCattleReproductionView")) %>%
filter(
!is.na(Calving),
StartDate >= as.Date("2021-01-01"),
StartDate <= as.Date("2021-01-31")
) %>%
transmute(
Farm = Farm,
SE_Number = SE_Number,
CalvingDate = StartDate,
CalvingEase = CalvingEase
)
# 2) Daily milk per cow per date (R object from Step 10)
# Make sure StartDate is Date class
DF.DailyMilkPerCow <- DF.DailyMilkPerCow %>%
mutate(StartDate = as.Date(StartDate))
# 3) Join daily milk to calving table and calculate Days in Milk (DIM)
DF.DailyMilkWithDIM <- DF.DailyMilkPerCow %>%
inner_join(
DF.CalvingWithOffspring_db %>% collect(),
by = c("SE_Number", "CalvingEase")
) %>%
mutate(
CalvingDate = as.Date(CalvingDate),
DIM = as.integer(StartDate - CalvingDate) # days since calving
) %>%
filter(DIM >= 0, DIM <= 365) # keep the first 365 days after calving
# 4) Plot dots (easier to interpret than lines when many cows overlap)
ggplot(
DF.DailyMilkWithDIM,
aes(x = DIM, y = DailyTotalYield, colour = CalvingEase)
) +
geom_point(alpha = 0.5, size = 1) +
scale_y_continuous(limits = c(0, NA)) + # Y axis starts at 0
labs(
title = "Daily milk yield after calving (Days in Milk)",
x = "Days in Milk (days since calving)",
y = "Daily milk yield (sum of TotalYield per day)",
colour = "CalvingEase"
) +
theme_minimal()We will now save the milking dataset to your computer as a CSV file. You can imagine that we later will investigate the appropriate end of the lactation as we see sveral animals having started their next lactation in the plot.
After saving, we close the database connection. This is good practice and releases resources on both your computer and the Basreg server.
# Save the milking data as a CSV file in your working directory
# The file will be saved in the folder returned by getwd()
# Read more here: https://www.rdocumentation.org/packages/utils/topics/write.csv
write.csv(DF.MilkingAfterCalving, "DF_MilkingAfterCalving.csv", row.names = FALSE)
# Close the database connection when you are done
# Using dbIsValid() avoids errors if the connection is already closed
# Read more here:
# - dbDisconnect(): https://dbi.r-dbi.org/reference/dbDisconnect.html
# - dbIsValid(): https://dbi.r-dbi.org/reference/dbIsValid.html
if (DBI::dbIsValid(con)) DBI::dbDisconnect(con)